1
ECEN 4532: DSP Laboratory
Professor Michael Perkins
Lab 1
Jesse Tao
February 5, 2018
2
Contents
1) Introduction 3
2) Audio signal: from the sound to the wav file 3
2.1) Segments, Frames, Samples 3
3) The music 4
4) Low level features: time domain analysis 4
4.1) Loudness 5
4.2) Zero-crossing 6
5) Low level features: time domain analysis 8
5.1) Windowing and spectral leaking 8
5.2) Spectral centroid and spread 14
5.3) Spectral Flatness 14
5.4) Spectral Flux 14
5.5) Application: Mpeg7 Low Level Audio Descriptors 19
6) Basic Psychoacoustic Quantities 19
6.1) Psychoacoustic 19
6.2) Perception of frequencies 20
6.3) The mel/Bark scale 20
6.4) The cochlear filterbank 20
7) Loudness and Spectral Figures 24
3
1) Introduction
With the availability of a large amount of digital music, there is an important need to be able to
invent tools that can search and organize all this music. Companies have applications that can
recognize songs based solely on the music.
This goal of this lab will be explore some of the digital signal processing techniques to
automatically extract features that characterize music.
2) Audio signal: from the sound to the wav file
The current standard for high quality audio and DVD is a sampling frequency of f
s
= 96kHz, and
a 24-bit depth for the quantization. The CD quality is 16 bit sampled at f
s
= 44.1 kHz.
The Nyquist theorem states that in order to accurately sample a maximum frequency of f
s
/2,
we must use a sampling frequency of f
s
. This means that we must use a sampling frequency of
240 kHz. As it is stated above, DVDs have a sampling frequency of 96 kHz which won’t for
dolphins. While there are likely other technologies that have a sampling frequency of 240 kHz,
they are likely not cost effective, so it is not viable to share the sounds of dolphins with the
typical consumer until better recording standards are developed.
2.1) Segments, Frames, Samples
In order to automatically analyze the music, the audio files are segmented into non overlapping
segments of a few seconds. This provides the coarsest resolution for the analysis. The audio
features are computed at a much finer resolution. The audio file is divided into frames of a few
milliseconds over which the analysis is conducted. In some cases the frames may be chosen to
overlap, whereas in other cases they will not.
In this lab the audio files are sampled at f
s
= 11, 050 Hz, and we consider intervals of size N =
512 samples, or 46 ms. This interval of 46 milliseconds is called a frame. We will implement
several algorithms that will return a set of features for each frame. While there are 512 samples
per frame, we will usually return a feature vector of much smaller size.
Assignment
1. Dolphins can only hears sounds over the frequency range [7-120] kHz. At what
sampling frequency f
s
should we sample digital audio signals for dolphins?
4
3) The music
Music being analyzed is from six different musical genres and are chosen as they are diverse
and also have interesting characteristics.
MATLAB function:
function [soundExtract,p] = extractSound(filename, time)
info = audioinfo(filename);
[song, ~] = audioread(filename);
if time >= info.Duration
soundExtract = song;
p = audioplayer(soundExtract,info.SampleRate);
return;
elseif time <= 1/info.SampleRate
error('Too small of a time to sample');
end
samples = time*info.SampleRate;
soundExtract = song(floor(info.TotalSamples/2)-floor(samples/2):1: ...
floor(info.TotalSamples/2)+floor(samples/2));
p = audioplayer(soundExtract, info.SampleRate);
end
The MATLAB function uses the built-in audioinfo function which provides all the features for
this function to extract the number of samples needed from any .wav file.
4) Low level features: time domain analysis
Assignment
2. Write a MATLAB function that extracts T seconds of music from a given track. You will
use the MATLAB function audioread to read a track and the functions audioplayer and play
(or playblocking) to listen to the track. In the lab you will use T =24 seconds and start at the
sample halfway through each track to compare the different algorithms. Download the files,
and test your function.
Assignment
3. Implement the loudness and ZCR and evaluate these features on the different music
tracks. The frames should be non overlapping. Your MATLAB function should display each
feature as a time series in a separate figure (see e.g. Fig. 1).
4. Comment on the specificity of the feature, and its ability to separate different musical
genre.
5
The most interesting features will involve some sophisticated spectral analysis. There are
however a few simple features that can be directly computed in the time domain, we describe
some of the most popular features in the following.
4.1) Loudness
The standard deviation of the original audio signal x[n] computed over a frame of size N provides a
sense of the loudness,
 
  
 



 


The following is the loudness function which uses the extractSound function from before and
splits the song into frames of size 255 overlapped halfway with the previous frame:
function [loudness_data] = loudness(filename, varargin)
%Computes the standard deviation of the original audio divided
%into frames of size 255 loudness(filename,[franeSize,time,plotBool])
switch nargin
case 1
frameSize=512;
time=24;
plotBool=1;
case 2
frameSize=varargin{1};
time=24;
plotBool=1;
case 3
frameSize=varargin{1};
time=varargin{2};
plotBool=1;
case 4
frameSize=varargin{1};
time=varargin{2};
plotBool=varargin{3};
otherwise
error('loudness:argChk', 'Wrong number of input arguments');
end
[y,~]=extractSound( filename, time );
frames_data = buffer(y,frameSize,ceil(frameSize/2));
loudness_data=std(frames_data,0,1);
if plotBool==1
p=plot(1:length(loudness_data),loudness_data(1,:));
title(['Loudness per frame: "' filename '"']);
xlim([0 length(loudness_data)+1]);
xlabel('Frame Number');
ylabel('Loudness');
saveas(p,['Loudness_' filename(1:end-4) '.png']);
6
end
end
The following is a plot of the loudness of “track201-classical.wav”. The figure shows that the
song starts off loud and then becomes quiet, with two decently loud sections in between and
becoming pretty soft at the end. Other loudness plots can be found at the end of the report in
the figures section.
The loudness function seems helpful in determining jazz and classical music. Classical does not
seem to change loudness much from high and low amplitudes, while jazz is very periodic where
there are periodic spikes with long periods of no change in between.
4.2) Zero-crossing
The Zero Crossing Rate is the average number of times the audio signal crosses the zero
amplitude line per time unit. The ZCR is related to the pitch height, and is also correlated to the
noisiness of the signal. We use the following definition of ZCR,



 
  
  



Note that to ensure that each frame’s ZCR rate can be computed independently of that of other
frames, the lower index in the summation is one larger than it is for the standard deviation.
7
Below is my MATLAB implementation of the zero crossing rate, similar to the loudness function
it takes the middle 24 seconds of a track and stores it into frames of size 255.
function [ ZCR_data ] = zeroCross( filename )
%zeroCross Computes the zero cross rate of an audio file
% The Zero Crossing Rate is the average number of times the audio signal
% crosses the zero amplitude line per time unit
frameSize=512;
time=24;
[y,~]=extractSound( filename, time ); % Operate on middle 24 seconds
frames_data = buffer(y,frameSize,ceil(frameSize/2));
ZCR_data=zeros(1,length(frames_data));
ZCR_data(1:end)=sum(abs(diff(frames_data(1:end,:))))/length(frames_data);
p=plot(1:length(ZCR_data),ZCR_data(1,:));
title(['ZCR per frame: "' filename '"']);
xlim([0 length(ZCR_data)+1]);
xlabel('Frame Number');
ylabel('ZCR per frame');
saveas(p,['ZCR_' filename(1:end-4) '.png']);
end
This is a plot of the ZCR function applied to “track201-classical.wav”. The shape is pretty similar
to the loudness plot, however the ZCR determines how noisy a track is instead. Other ZCR plots
can be found in the figures section at the end of the report.
Just like the loudness function, the ZCR is useful in identifying classical and jazz music. There
doesn’t seem to be any clear advantages in using one function over the other and needs to be
tested with much more music to see the potential advantages of one function over the other.
8
5) Low level features: time domain analysis
5.1) Windowing and spectral leaking
Music is made up of notes of different pitch. It is only natural that most of the automated analysis of
music should be performed in the spectral (frequency) domain.
Our goal is the reconstruction of a musical score. Our analysis requires N = 512 samples to compute
notes over a frame. The spectral analysis proceeds as follows. Each frame is smoothly extracted by
multiplying the original audio signal by a tapper window w. The Fourier transform of the windowed
signal is then computed. If x
n
denotes a frame of size N = 512 extracted at frame n, and w is a window of
size N, then the Fourier transform, X
n
(of size N) for the frame n is given by
There are several simple descriptors that can be used to characterize the spectral information provided
by the Fourier transform over a frame.
The Fourier transform of the cosine function is easy to compute. It is the sum of 2 exponentials,
and the Fourier transform of this is simply the Dirac delta function offset by frequency ω. Thus
the Fourier transform of (4) is:
  
 
 
  
 (3)
9
Using knowledge from the lecture class, we know that the Fourier transform of two functions in
the time domain is the same as the convolution of two functions in the Fourier domain. We can
also use the knowledge that a time shift in the time domain equates to a scaling in the Fourier
domain. Using these two properties the Fourier transform of y[n] is:





  

  



Assignment
6. In practice, we work with a finite signal, and we multiply the signal x[n] by a window w[n]. We
assume that the window w is non zero at times n = −N/2, . . . , N/2, and we define
  
Derive the expression of the Fourier transform of y[n] in terms of the Fourier transform of x and the
Fourier transform of the window w.
10
In order to go between a continuous time perception of sound and a computer’s discrete time
processing, a windowing function is used. This way, we are able to create the desired sound as
close as possible without generating high frequency components that could stop us from
sampling. Below is the function used to compute the windowed Fourier transform.
function [Xn] = windows( filename, window )
%windows Implement the computation of the windowed Fourier transform of y
frameSize=512;
if(nargin == 1)
window = 3;
end
switch window
case 1
w=bartlett(frameSize); % Bartlett window
case 2
w=hann(frameSize); % Hann (Hanning) window
case 3
w=kaiser(frameSize,0.5); % Kaiser window
otherwise
Assignment
7. Implement the computation of the windowed Fourier transform of y, given by (3). Evaluate its
performance with pure sinusoidal signals and different windows:
• Bartlett
• Hann
• Kaiser,
8. Compute the spectrogram of an audio track as follows:
(a) Decompose a track into a sequence of N
f
overlapping frames of size N. The overlap between two
frames should be N/2.
(b) Compute the magnitude squared of the Fourier transform,  , k = 1, . . . , K over each frame
n.
(c) Display the Fourier transform of all the frames in a matrix of size K × N
f
.
The spectrogram should look like Fig. 2. (You may use the Matlab function spectrogram( ) with
the ’power’ option. It does the required computation and presents the result in dB form).
You will experiment with different audio tracks, as well as pure sinusoidal tones. Do the
spectrograms look like what you hear?
In the rest of the lab, we will be using a Kaiser window to compute the Fourier transform, as
explained in (3).
11
w=kaiser(frameSize,0.5); % defaults to Kaiser window
end
[x,~]=extractSound(filename,24);
xn=buffer(x,frameSize,frameSize/2);
Y=zeros(size(xn));
for i=1:length(xn)
Y(:,i)=fft(xn(:,i).*w);
end
K=frameSize/2+1;
Xn=size(Y);
for i=1:length(xn)
Xn(1:K,i)=Y(1:K,i);
end
end
Below is the difference in performance between the windows, it seems that the Kaiser window
performs the best, but only by about 70 ms.
In addition to how quickly each window performs, we can also see how effective each window
is at extracting our signal using the spectrogram function, the spectrograms for each window
are shown below.
12
13
As it can be seen here, the windows are pretty similar and have very similar power to frequency
ratios for the “track201-clasical.wav” track. Below is the spectrogram code.
filename='track201-classical.wav';
[x,~]=audioread(filename);
info=audioinfo(filename);
nsc = 512;
nov = floor(nsc/2);
nff = max(256,2^nextpow2(nsc));
figure
spectrogram(x,bartlett(nsc),nov,nff,info.SampleRate,'yaxis');
title('Bartlett Window');
ax=gca;
saveas(ax,'BartlettWindow.png');
figure
spectrogram(x,hamming(nsc),nov,nff,info.SampleRate,'yaxis');
title('Hamming Window');
ax=gca;
saveas(ax,'HammingWindow.png');
figure
spectrogram(x,kaiser(nsc),nov,nff,info.SampleRate,'yaxis');
title('Kaiser Window');
ax=gca;
saveas(ax,'KaiserWindow.png');
close all;
14
5.2) Spectral centroid and spread
For these concepts we will treat the normalized magnitude of a spectral coefficient as if it were the
probability that a particular frequency value occurs. In other words, for frame n, we have for the
“probability” of frequency k




We then define the spectral centroid (the “center of mass”) of the spectrum as



where we again think of k as the frequency.
The spectral spread for frame n is then the standard deviation given by



The spectral centroid can be used to quantify sound sharpness or brightness. The spread quantifies the
width of the spectrum around the centroid, and thus helps differentiate between tone-like and noise-
like sounds.
5.3) Spectral Flatness
Spectral flatness is the ratio between the geometric and arithmetic means of the magnitude of the
Fourier transform,






The flatness is always smaller than one since the geometric mean is always smaller than the arithmetic
mean. The flatness is one, if all |X(k)| are equal. This happens for a very noisy signal. A very small
flatness corresponds to the presence of tonal components. In summary, this is a measure of the
noisyness of the spectrum.
5.4) Spectral Flux
The spectral flux is a global measure of the spectral changes between two adjacent frames, n − 1 and n,
15




where P
n
(k) is the normalized frequency distribution for frame n, given by (7).
Just as songs have low level time domain characteristics, they also have low level frequency
domain characteristics. With modern DSP chips, computing an FFT is no longer that big of an
issue for DSP engineers. Below are plots for the frequency centroid, spread, flux, and flatness
for the “track201-classical.wav” track. Plots for all other tracks can be found at the end of the
report.
Assignment
9. Implement all the low level spectral features and evaluate them on the different tracks. Your
MATLAB function should display each feature as a time series in separate figure (e.g. see Fig. 3).
10. Comment on the specificity of the feature, and its ability to separate different musical genres.
16
17
Again, these features are most useful in identifying classical and jazz music. Even though these
are more in depth tools, they seem to provide the same amount of usefulness as the time-
domain analyses. Below is the code used to generate these plots.
freqDist.m:
function [ Xk ] = freqDist( filename )
%freqDist Computes the frequency distribution
% info=audioinfo(filename);
song=extractSound(filename);
frames_overlap = buffer(song,512,256);
w=kaiser(512);
N=512;
Y=fft(w.*frames_overlap(:,2));
Xk=Y(1:256);
K=N/2+1;
for i = 1:length(frames_overlap)
Y=fft(w.*frames_overlap(:,i));
Xk(1:K,i)=Y(1:K);
end
end
centroidSpread.m:
function [ centroid,sigmaK ] = centroidSpread( filename )
%centroid Summary of this function goes here
18
% The spectral centroid can be used to quantify sound sharpness or
% brightness. The spread quantifies the spread of the spectrum around
% the centroid, and thus helps differentiate between tone-like and
% noise-like sounds.
close all;
Xk=freqDist(filename);
[a,b]=size(Xk);
centroid=zeros(1,b);
sigmaK=zeros(1,b);
for n=1:b
for k=1:a
centroid(n)=centroid(n)+(k/a*Xk(k,n));
end
end
for n=1:b
for k=1:a
sigmaK(n)=sigmaK(n)+(1/(a-1))*(Xk(k,n)-centroid(n))^2;
end
end
sigmaK=sqrt(sigmaK);
x=1:b;
figure
plot(x,sigmaK);
title(['Frequency Spread: "' filename '"']);
xlabel('Frame Number');
ylabel('Spread');
xlim([0,b]);
saveas(gca,['freqSpread' filename(1:end-4) '.png']);
figure
plot(x,centroid);
title(['Frequency Centroid: "' filename '"']);
xlabel('Frame Number');
ylabel('Centroid');
xlim([0,b]);
saveas(gca,['freqCent' filename(1:end-4) '.png']);
close all;
end
specFlat.m:
function [ SFn ] = specFlat( filename )
%specFlat Computes the spectral flatness of a signal
% Spectral flatness is the ratio between the geometric and arithmetic
% means of the magnitude of the Fourier transform
Xk=freqDist(filename);
SFn=(geomean(abs(Xk))./mean(abs(Xk)));
h=figure;
plot(SFn);
title(['Spectral Flatness: "' filename '"']);
xlabel('Frame Number');
ylabel('Flatness');
19
xlim([0,length(SFn)]);
saveas(gca,['specFlat_' filename(1:end-4) '.png']);
close(h);
end
specFlux.m:
function [ Fn] = specFlux( filename)
%specFlux Measures how quickly power spectrum is changing
% The spectral flux is a global measure of the spectral changes between
% two adjacent frames, n-1 and n
Xk=freqDist(filename);
Fn=sum(diff(Xk).^2);
close all;
figure
plot(Fn);
title(['Spectral Flux: "' filename '"']);
xlabel('Frame Number');
ylabel('Flux');
xlim([0,length(Fn)]);
saveas(gca,['specFlux' filename(1:end-4) '.png']);
close all;
end
5.5) Application: Mpeg7 Low Level Audio Descriptors
MPEG 7, also known as “Multimedia Content Description Interface,” provides a standardized set of
technologies for describing multimedia content. Part 4 of the standard specifies description tools that
pertain to multimedia in the audio domain. The standard defines Low-level Audio Descriptors (LLDs) that
consist of a collection of simple, low complexity descriptors to characterize the audio content. Some of
the features are similar to the spectral features defined above.
6) Basic Psychoacoustic Quantities
In order to develop more sophisticated algorithms to analyze music based on its content, we need to
define several subjective features such as timbre, melody, harmony, rhythm, tempo, mood, lyrics, etc.
Some of these concepts can be defined formally, while others are more subjective and can be
formalized using a wide variety of different algorithms. We will focus on the features that can be defined
mathematically.
6.1) Psychoacoustic
Psychoacoustics involves the study of the human auditory system, and the formal quantification of
the relationships between the physics of sounds, and our perception of audio. We will describe some
key aspects of the human auditory system:
1. the perception of frequencies and pitch for pure and complex tones;
20
2. the frequency selectivity of the auditory system: our ability to perceive two similar frequencies as
distinct;
3. the modeling of the auditory system as a bank of auditory filters;
4. the perception of loudness;
5. the perception of rhythm.
6.2) Perception of frequencies
The auditory system, like the visual system, is able to detect frequencies over a wide range of scales. In
order to measure frequencies over a very large range, it operates using a logarithmic scale. Let us
consider a pure tone, modeled by a sinusoidal signal oscillating at a frequency f. If f < 500 Hz, then the
perceived tone or pitch varies as a linear function of f. When f > 1, 000 Hz, then the perceived pitch
increases logarithmically with f. Several frequency scales have been proposed to capture the logarithmic
scaling of frequency perception.
6.3) The mel/Bark scale
The Bark scale (named after the German physicist Barkhausen) is defined as




  

where f is measured in Hz. The mel-scale is defined by the fact that 1 bark = 100 mel. In this lab we will
use a slightly modified version of the mel scale defined by
 
 


Note that, by construction, the m value corresponding to 1000 Hz is 1000 (the logarithm in the above
formula is the natural log)
6.4) The cochlear filterbank
Finally, we need to account for the fact that the auditory system behaves as a set of filters, i.e. as a
filterbank, whose filters have overlapping frequency responses. For each filter, the range of frequencies
over which the filter response is significant is called a critical band. A critical band is a band of audio
frequencies within which a second tone will interfere with the first via auditory masking.
Our perception of pitch can be quantified using the total energy at the output of each filter. All spectral
energy that falls into one critical band is summed up, leading to a single number for that band.
We describe in the following a simple model of the cochlear filterbank. The filter bank is constructed
using N
B
= 40 logarithmically spaced triangle filters centered at the frequencies Ω
p
, which are implicitly
defined by
21

 
 


The sequence of mel frequencies corresponding to the Ω
p
are chosen to be equally spaced in the mel
scale. Letting the indexing of the center frequencies of the filters start with p = 1 we have





 



where N
B
is the number of filters in the filterbank,


  
   





 
 



and f
s
is the sampling frequency of the audio being analyzed. As mentioned above, we’ll let N
B
= 40. We
see that mel
max
is the m value corresponding to half the sampling frequency, which is the highest
frequency preserved in the digital signal. We also see that mel
min
corresponds to a frequency of 20 Hz.
(These min and max values are not used as center frequencies for any filters, rather, they are the upper
and lower frequency limits that are covered by the filters in the filter bank.)
Each filter H
p
is centered around the frequency Ω
p
and is defined by
 





 





 









Each triangular filter is normalized such that the integral of each filter is 1. In addition, the filters overlap
so that the frequency at which the filter H
p
is maximum is the starting frequency for the next filter H
p+1
,
and the edge frequency of H
p-1
.
Finally, the mel-spectrum (MFCC) coefficients of the n-th frame are defined for p = 1, …, N
B
as





where the filter H
p
is a discrete implementation of the continuous filter defined by (17). The discrete
filter is normalized such that


22
Below is the code used to create the filter bank, and are now one step closer to defining music
by genre. Now we can categorize pitches more closely to how a human ear would do so.
function [ fbank ] = melBank(~)
%melBank Creates a set of mel filter banks
% Implement the computation of the triangular filterbanks
% Hp, p = 1,...,NB. Your function will return an array fbank of size
% NB x K such that fbank(p,:) contains the filter bank Hp.
nbanks = 40; % Number of Mel frequency bands
% linear frequencies
N=512;
K=N/2+1;
% fbank=zeros(nbanks,K);
fs=11025;
linFrq = 20:fs/2;
% mel frequencies
melFrq = log ( 1 + linFrq/700) *1127.01048;
% equispaced mel indices
melIdx = linspace(1,max(melFrq),nbanks+2);
% From mel index to linear frequency
melIdx2Frq = zeros (1,nbanks+2);
% melIdx2Frq (p) = \Omega_p
for i=1:nbanks+2
[~, indx] = min(abs(melFrq - melIdx(i)));
melIdx2Frq(i) = linFrq(indx);
end
% map frequency banks
fbank_freq = linspace(0,floor(fs/2),K);
fbank=zeros(nbanks,K);
% Implement equation for mel filters
for i = 2:nbanks+1
range=2/(melIdx2Frq(i+1)-melIdx2Frq(i-1));
for j=1:K
filt_val=range;
if(fbank_freq(j) <= melIdx2Frq(i)) ...
&& (fbank_freq(j) > melIdx2Frq(i-1))
filt_val = filt_val*((fbank_freq(j) -melIdx2Frq(i-1)) ...
/(melIdx2Frq(i) - melIdx2Frq(i-1)));
fbank(i-1,j) = filt_val;
Assignment
11. Implement the computation of the triangular filterbanks H
p
, p = 1, . . . , N
B
. Your function will
return an array fbank of size N
B
× K such that fbank(p,:) contains the filter bank H
p
.
12. Implement the computation of the mfcc coefficients, as defined in (18).
23
elseif(fbank_freq(j) < melIdx2Frq(i+1)) ...
&& (fbank_freq(j) >= melIdx2Frq(i))
filt_val = filt_val*((melIdx2Frq(i+1) - fbank_freq(j))...
/(melIdx2Frq(i+1) - melIdx2Frq(i)));
fbank(i-1,j) = filt_val;
else
fbank(i-1,j)=0;
end
end
end
for i=1:nbanks
norm=sum(fbank(i,:));
for j=1:K
fbank(i,j)=fbank(i,j)/norm;
end
end
if(nargin)
close all;
figure
plot(fbank.');
title('Mel Filter Bank');
xlabel('Frequency (Hz)');
ylabel('Filter Magnitude');
xlim([0,length(fbank)]);
saveas(gca,'melFilterBank.png');
close all;
end
end
Here is the code used to calculate the mfcc coefficients:
function [ mfccp ] = mfcc( fbank,Xn )
%mfcc Computes the Mel frequency coeffecients
% The mel-spectrum (MFCC) coefficient of the n-th frame is defined for
% p = 1,...,NB
% fbank=fbank/max(fbank);
[a,~]=size(fbank);
[c,d]=size(Xn);
% a should be 40
% c should be 257
% d should be num frames (1032)
mfccp=zeros(a,d);
for n=1:d % frames
mfcc_val=0;
for p=1:a % banks
for k=1:c
mfcc_val = mfcc_val + abs((fbank(p,k)*Xn(k,n))^2);
end
mfccp(p,n)=mfcc_val;
end
end
mfccp=10*log(mfccp)/log(10);
end
24
Here is a plot of the Mel Filter Bank:
7) Loudness and Spectral Figures
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53